Github link: https://github.com/r-public-health-group/monkey-pox
Team roles and responsibilities: https://github.com/r-public-health-group/monkey-pox/blob/main/team_roles_and_responsibilities.docx
We are interested in how monkeypox case rates differ by European region and population demographics. We are first utilizing a monkeypox case dataset from the European Centre for Disease Prevention and Control that spans 2022-05-09 to 2022-08-23 and includes the daily number of confirmed monkey pox cases by country and how the data was collected. Then, we will incorporate a European population dataset from , which includes European countries’ yearly population data from 2011 to 2022 based on the total population residing in that country as of January 1st of each year. Finally, we will incorporate the European Statistical System’s 2011 European census data which includes the EU country codes, sex at birth, age ranges, employment status (CAS), education status, population of locality of residence and the number of people in each strata.
As we prepare our report on monkeypox case rates to the state health department, the monkeypox case data will provide us with the number of cases that will be the numerator of our case rate calculation, the population data gives us the size of the population at risk that will be the denominator of our case rate calculation, and the census data will allow us to stratify the case rates by various demographic factors.
All three of our CSVs are hosted on the course github page: https://github.com/PHW290/phw251_projectdata. We navigated to our respective csvs, clicked on raw data to get the standalone csv, and saved it into a url variable in this Rmd. We then used the read_csv function from the readr library to import our data into rstudio.
mpx_cases_url = "https://raw.githubusercontent.com/PHW290/phw251_projectdata/main/euro_mpx_cases.csv"
mpx_cases_df <- read_csv(mpx_cases_url)
pop_denoms_url = "https://raw.githubusercontent.com/PHW290/phw251_projectdata/main/euro_pop_denominators.csv"
pop_denoms_df = read_csv(pop_denoms_url)
census_stats_url = "https://raw.githubusercontent.com/PHW290/phw251_projectdata/main/euro_census_stats.csv"
census_stats_df <- read_csv(census_stats_url,
col_names = TRUE,
col_types = NULL,
na = c("", "NA"))
We are now lowercasing all the column names.
mpx_cases_df <- rename_with(mpx_cases_df, ~tolower(.x))
pop_denoms_df <- rename_with(pop_denoms_df, ~tolower(.x))
census_stats_df <- rename_with(census_stats_df, ~tolower(.x))
Details of key data elements are outlined below. All elements listed are in an appropriate data format for the joins and analysis we intend to do.
str(mpx_cases_df$daterep)
## Date[1:2987], format: "2022-05-09" "2022-05-09" "2022-05-09" "2022-05-09" "2022-05-09" ...
min(mpx_cases_df$daterep)
## [1] "2022-05-09"
max(mpx_cases_df$daterep)
## [1] "2022-08-23"
unique(dplyr::count(mpx_cases_df, daterep)$n)
## [1] 29
The dates are in date format, they span from 2022-05-09 to 2022-08-23, and each date shows up 29 times.
typeof(mpx_cases_df$countrycode)
## [1] "character"
unique(mpx_cases_df$countrycode)
## [1] "AT" "BE" "BG" "HR" "CY" "CZ" "DK" "EE" "FI" "FR" "DE" "EL" "HU" "IS" "IE"
## [16] "IT" "LV" "LT" "LU" "MT" "NL" "NO" "PL" "PT" "RO" "SK" "SI" "ES" "SE"
length(unique(mpx_cases_df$countrycode))
## [1] 29
The variable countryexp is the country name where countrycode is it’s two letter counterpart. Both are in character format. There are 29 countries respresented.
typeof(mpx_cases_df$source)
## [1] "character"
unique(mpx_cases_df$source)
## [1] "TESSy" "EI"
Source is in a character format. There are two sources of data: “TESSy” and “EI.”
typeof(mpx_cases_df$confcases)
## [1] "double"
unique(mpx_cases_df$confcases)
## [1] 0 1 4 2 3 6 5 20 9 11 82 21 14 13 7 10 8 16
## [19] 22 26 12 19 15 27 30 33 47 25 31 18 41 34 40 38 61 43
## [37] 42 72 46 48 66 70 55 56 95 97 53 23 90 76 89 86 24 28
## [55] 67 17 101 110 74 93 83 58 106 71 130 151 79 68 109 144 184 69
## [73] 29 176 191 78 158 102 199 112 37 131 44 113 155 160 140 655 136 170
## [91] 60 172 45 159 388 63 139 114 137 284 147 51 77 121 124 128 178 91
## [109] 73 65
min(mpx_cases_df$confcases)
## [1] 0
mean(mpx_cases_df$confcases)
## [1] 5.715434
max(mpx_cases_df$confcases)
## [1] 655
The confirmed cases are in a numeric data format and range from 0 to 655. The mean is 5.7 cases per day.
typeof(pop_denoms_df$geo)
## [1] "character"
length(unique(pop_denoms_df$geo))
## [1] 54
The values in the GEO column of the pop_denoms_df are characters denoting the geopolitical region that a given row’s population data comes from. There are 54 possible values for this column, either a 2-letter country code or a code denoting the entire European Union on a given year. We will eventually merge these 2-letter country codess with the 2-letter country codes in the euro_mpx_cases csv, so the character data type will suffice.
summary(pop_denoms_df$obs_value)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31863 2088384 7040272 44417154 19702267 513206391
typeof(pop_denoms_df$obs_value)
## [1] "double"
The values in the obs_value column are the observed population on January 1 of the reported year. These values are doubles representing the total population count of a given region, so we can use these values directly as the denominators in our case rate calculations.
typeof(census_stats_df$country_code)
## [1] "character"
unique(census_stats_df$country_code)
## [1] "AT" "BE" "BG" "CH" "CY" "CZ" "DE" "DK" "EE" "EL" "ES" "FI" "FR" "HR" "HU"
## [16] "IE" "IS" "IT" "LI" "LT" "LU" "LV" "MT" "NL" "NO" "PL" "PT" "RO" "SE" "SI"
## [31] "SK" "UK"
length(unique(census_stats_df$country_code))
## [1] 32
Country Code is a character data type and there are 32 different two-letter country codes in this dataset. We will obtain the different regions from the country_code variable.
unique(census_stats_df$sex)
## [1] "F" "M"
Sex is a character data type, and there are two categories: male, female. We might use sex as a demographic variable to stratify case rates by.
typeof(census_stats_df$age)
## [1] "character"
unique(census_stats_df$age)
## [1] "Y_GE85" "Y_LT15" "Y15-29" "Y30-49" "Y50-64" "Y65-84"
Age is character data type that includes six different age ranges in the dataset: <15, 15-29, 30-49, 65-84 & >85. We might use age as a demographic variable to stratify case rates by.
typeof(census_stats_df$cas)
## [1] "character"
unique(census_stats_df$cas)
## [1] "ACT" "EMP" "INAC" "UNE" "UNK"
Cas is a character data type and represents the employment status of an individual. There are five different statuses, and we might use these to stratify case rates by.
typeof(census_stats_df$edu)
## [1] "character"
unique(census_stats_df$edu)
## [1] "ED1" "ED2" "ED3" "ED4" "ED5" "ED6" "NAP" "NONE" "UNK"
Education status is in a character format, and there are nine different statuses of education. We might use education status to stratify case rates by.
typeof(census_stats_df$res_pop)
## [1] "character"
unique(census_stats_df$res_pop)
## [1] "500000-999999" "10000-99999" "200000-499999" "100000-199999"
## [5] "GE1000000" "1000-9999" "0-1000"
Resident population (res_pop) is in a character data format. There are seven levels of population in each locality of residence.
typeof(census_stats_df$pop)
## [1] "double"
range(census_stats_df$pop)
## [1] 0 1702270
The strata population is in a numeric data format. The number of people in the different location strata ranges from 0 to 1,702,270.
Subset rows or columns, as needed
#acquire region dataset
regions <- read.csv('https://raw.githubusercontent.com/PHW290/phw251_projectdata/main/world_country_regions.csv')
#filtered by europe countries
regions <- regions %>%
filter(region == 'Europe') %>%
select(iso_3166.2, sub.region, region)
#create country code variable to join tables
regions <- regions %>%
mutate(country_code = substr(regions$iso_3166.2,12,13) )
#filtered mpx_cases columns
mpx_cases_df <- mpx_cases_df %>%
select(daterep, countrycode, confcases)
#filter pop_denom for 2011 for census analysis
pop_denoms_2011 <- pop_denoms_df %>%
filter(time_period == 2011) %>%
select(geo, obs_value)
#filtered pop_denom by year and columns
pop_denoms_df <- pop_denoms_df %>%
filter(time_period == 2022) %>%
select(geo, obs_value)
#joined all three datasets together
df <- inner_join(x = regions, y = mpx_cases_df, by = c('country_code' = 'countrycode'))
df <- inner_join(x = df, y = pop_denoms_df, by = c('country_code' = 'geo'))
df <- df %>%
select(-iso_3166.2, -region)
Create new variables needed for analysis (minimum 2)
#created case rate by ten million people
df_grouped <- df %>%
group_by(sub.region, daterep) %>%
summarise(total_cases = sum(confcases), total_obs = sum(obs_value)) %>%
mutate(case_rate_by_tenmillion = total_cases / total_obs * 10000000)
## `summarise()` has grouped output by 'sub.region'. You can override using the
## `.groups` argument.
Clean variables needed for analysis (minimum 2)
#we used an inner join and there were no discrepancies between datasets,
#therefore are no nulls caused by joining
sum(is.na(df_grouped))
## [1] 0
#make sub.region a categorical value
df_grouped <- df_grouped %>%
mutate(sub.region = factor(sub.region))
#make new month column & made it a categorical value
df_grouped <- df_grouped %>%
mutate(month = strftime(daterep,"%B")) %>%
mutate(month = factor(month, levels = c('May','June','July','August')))
str(df_grouped)
## grouped_df [412 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ sub.region : Factor w/ 4 levels "Eastern Europe",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ daterep : Date[1:412], format: "2022-05-09" "2022-05-13" ...
## $ total_cases : num [1:412] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_obs : num [1:412] 89171711 89171711 89171711 89171711 89171711 ...
## $ case_rate_by_tenmillion: num [1:412] 0 0 0 0 0 0 0 0 0 0 ...
## $ month : Factor w/ 4 levels "May","June","July",..: 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "groups")= tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ sub.region: Factor w/ 4 levels "Eastern Europe",..: 1 2 3 4
## ..$ .rows : list<int> [1:4]
## .. ..$ : int [1:103] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ : int [1:103] 104 105 106 107 108 109 110 111 112 113 ...
## .. ..$ : int [1:103] 207 208 209 210 211 212 213 214 215 216 ...
## .. ..$ : int [1:103] 310 311 312 313 314 315 316 317 318 319 ...
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
Data dictionary based on clean dataset (minimum 4 data elements), including: Fields Variable name Data type Description
sub.region: factor, region in europe
daterep: date, year-month-day by week from may to august 2022
total_cases: num, total monkeypox cases in a subregion by week
total_obs: num, total population by subregion
case_rate_by_tenmillion: num, the rate of monkeypox in a subregion by ten million people
month: date, the month from may to august 2022
One or more tables with descriptive statistics for 4 data element
summary(df_grouped)
## sub.region daterep total_cases total_obs
## Eastern Europe :103 Min. :2022-05-09 Min. : 0.0 Min. : 38749061
## Northern Europe:103 1st Qu.:2022-06-07 1st Qu.: 1.0 1st Qu.: 76566048
## Southern Europe:103 Median :2022-07-03 Median : 8.0 Median :106223452
## Western Europe :103 Mean :2022-07-02 Mean : 41.3 Mean :110280452
## 3rd Qu.:2022-07-29 3rd Qu.: 53.0 3rd Qu.:139937856
## Max. :2022-08-23 Max. :764.0 Max. :189925840
## case_rate_by_tenmillion month
## Min. : 0.000 May : 76
## 1st Qu.: 0.158 June :120
## Median : 1.009 July :124
## Mean : 3.048 August: 92
## 3rd Qu.: 3.894
## Max. :40.226
As requested by the leadership of our state health department, we have created a bar chart to visualize monthly MPX case rates for the different regions within the EU.
plot_ly(
df_grouped,
x= ~month,
y= ~case_rate_by_tenmillion,
color= ~sub.region,
type="bar"
) %>%
layout(
title="MPX Case Rates by EU Region, May to August 2022",
yaxis=list(title="Case Rates per 10,000,000"),
xais=list(title="Month"),
paper_bgcolor="white",
plot_bgcolor="white"
)
To evaluate how census data trends for a region compare to that region’s monkeypox rates, we will first calculate the % of the population in each region that is 85 years old or older and the % of the population in each region that is female. Then we will compare these rates against the regions’ monkey pox rates and evaluate any trends we see.
# First, calculate the number of females in each country.
sex_pct_df <- census_stats_df %>%
group_by(country_code, sex) %>%
summarize(sex_count = sum(pop)) %>%
group_by(country_code) %>%
mutate(country_pop = sum(sex_count)) %>%
filter(sex == "F",
country_pop != 0) %>%
rename(female_count = sex_count)
## `summarise()` has grouped output by 'country_code'. You can override using the
## `.groups` argument.
# Then, calculate the number greater than or equal to 85 years old in each country.
age_pct_df <- census_stats_df %>%
group_by(country_code, age) %>%
summarize(age_count = sum(pop)) %>%
group_by(country_code) %>%
mutate(country_pop = sum(age_count)) %>%
filter(age == "Y_GE85",
country_pop != 0) %>%
rename(ge85_count = age_count)
## `summarise()` has grouped output by 'country_code'. You can override using the
## `.groups` argument.
# Join these two df's together.
demog_df <- full_join(x = sex_pct_df, y = age_pct_df, by = "country_code")
demog_df <- demog_df %>%
select(country_code, ge85_count, female_count, country_pop.x) %>%
rename(country_pop = country_pop.x)
# Join the combined df with the regions df so we can map country codes to regions.
##demog_df <- inner_join(x = regions, y = demog_df, by = 'country_code')
# Calculate the % of the population that is GE85 and the % that is Female
demog_df <- demog_df %>%
group_by(country_code) %>%
summarize(pct_ge85 = round(sum(ge85_count) / sum(country_pop) * 100,2),
pct_female = round(sum(female_count) / sum(country_pop) * 100, 2))
# Group the monkeypox case rate df by region and calculate regional case rate
df_case_rate <- df %>%
group_by(country_code) %>%
summarize(total_cases = sum(confcases),
total_obs = sum(obs_value)) %>%
mutate(case_rate_p_mill = total_cases / total_obs * 1000000)
# Join case rates with the regional census data
demog_df <- inner_join(x = demog_df, y = df_case_rate, by = "country_code")
demog_df <- select(demog_df, c(-total_cases, -total_obs))
From this table view, one can see that there likely isn’t a trend between a country’s percent of the population that is female and its monkeypox rate. However, a positive trend is possible between percent of the population that is 85 and older and that country’s monkeypox case rate, however a visual plot will help elucidate the relationship better.
kable(demog_df,longtable=T,booktabs=T,
col.names=c("Country Code", "% 85 and Older", "% Female", "Case Rate per Million" ),
caption="2011 Census Rates versus 2022 Monkeypox Rates by Country") %>%
kable_styling(full_width=F) %>%
kable_styling(position="center") %>%
kable_styling(font_size=10)
| Country Code | % 85 and Older | % Female | Case Rate per Million |
|---|---|---|---|
| AT | 1.58 | 49.79 | 0.2497757 |
| BE | 1.57 | 49.42 | 0.5600969 |
| BG | 1.01 | 50.07 | 0.0056785 |
| CZ | 1.01 | 49.36 | 0.0378501 |
| EE | 1.19 | 52.64 | 0.0728996 |
| FI | 1.43 | 50.38 | 0.0384973 |
| FR | 1.74 | 50.33 | 0.4134357 |
| HR | 1.00 | 50.09 | 0.0625712 |
| HU | 1.15 | 50.69 | 0.0641303 |
| IE | 0.88 | 48.60 | 0.2455963 |
| LU | 1.12 | 48.73 | 0.7070232 |
| LV | 1.08 | 53.17 | 0.0207036 |
| MT | 1.08 | 46.36 | 0.5777114 |
| PL | 0.94 | 49.68 | 0.0311985 |
| PT | 1.51 | 50.91 | 0.7596644 |
| RO | 0.98 | 49.18 | 0.0178487 |
| SI | 1.14 | 48.77 | 0.1981206 |
| SK | 0.76 | 49.54 | 0.0214372 |
ggplot(demog_df, aes(pct_ge85, case_rate_p_mill)) +
geom_point() +
geom_smooth(formula = y ~ x, method = lm, se=FALSE) +
labs(x='Percent of the country that is 85 years and older',
y='Case rate of country per 1,000,000',
title='Monkeypox rates increase in countries with more people who are 85 and older') +
theme_light()
ggplot(demog_df, aes(pct_female, case_rate_p_mill)) +
geom_point() +
geom_smooth(formula = y ~ x, method=lm, se=FALSE) +
labs(x='Percent of the country that is female',
y='Case rate of country per 1,000,000',
title='Monkeypox rates decrease in countries with more females') +
theme_light()